The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period
The dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the Target Variable.
The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action
Administrative,Informational,Product Related - Different types of pages visited by the visitor in that session
Administrative Duration,Informational Duration,Product Related Duration -Total time spent in each of these page categories.
METRICS measured by google analytics for each page in e-commerce
Rate - The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session
Exit Rate - The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.
Page Value - The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction.
Special Day - The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler,MinMaxScaler,PowerTransformer
from sklearn.model_selection import train_test_split,cross_val_score,KFold,GridSearchCV
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import r2_score,f1_score,accuracy_score,roc_curve,roc_auc_score,confusion_matrix,classification_report
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from scipy.stats import stats
import scipy.stats as stats
import statsmodels
import statsmodels.api as sm
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings('ignore')
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
df = pd.read_csv('online_shoppers_intention.csv')
df.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 1 | 1 | 1 | 1 | Returning_Visitor | False | False |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Feb | 2 | 2 | 1 | 2 | Returning_Visitor | False | False |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 4 | 1 | 9 | 3 | Returning_Visitor | False | False |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Feb | 3 | 2 | 2 | 4 | Returning_Visitor | False | False |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Feb | 3 | 3 | 1 | 4 | Returning_Visitor | True | False |
print ('Number of rows -',df.shape[0])
print ('Number of columns -',df.shape[1])
Number of rows - 12330 Number of columns - 18
categorical_columns=df.select_dtypes(include='object').columns
numerical_columns=df.select_dtypes(exclude='object').columns
print('categorical columns:',categorical_columns)
print('numerical columns:',numerical_columns)
categorical columns: Index(['Month', 'VisitorType'], dtype='object')
numerical columns: Index(['Administrative', 'Administrative_Duration', 'Informational',
'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay',
'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'Weekend',
'Revenue'],
dtype='object')
for i in df.select_dtypes(include='object'):
print('no of categories:',i,'is',len(df[i].value_counts()))
no of categories: Month is 10 no of categories: VisitorType is 3
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12330 entries, 0 to 12329 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Administrative 12330 non-null int64 1 Administrative_Duration 12330 non-null float64 2 Informational 12330 non-null int64 3 Informational_Duration 12330 non-null float64 4 ProductRelated 12330 non-null int64 5 ProductRelated_Duration 12330 non-null float64 6 BounceRates 12330 non-null float64 7 ExitRates 12330 non-null float64 8 PageValues 12330 non-null float64 9 SpecialDay 12330 non-null float64 10 Month 12330 non-null object 11 OperatingSystems 12330 non-null int64 12 Browser 12330 non-null int64 13 Region 12330 non-null int64 14 TrafficType 12330 non-null int64 15 VisitorType 12330 non-null object 16 Weekend 12330 non-null bool 17 Revenue 12330 non-null bool dtypes: bool(2), float64(7), int64(7), object(2) memory usage: 1.5+ MB
# Bool - Weekend, Revenue
# Object - Month, VisitorType
# Numerical - Administrative, Administrative_Duration, Informational, Informational_Duration,
# ProductRelated, ProductRelated_Duration, BounceRates, ExitRates, PageValues, SpecialDay,
# OperatingSystems, Browser, Region, TrafficType
# 5 point summary
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Administrative | 12330.0 | 2.315166 | 3.321784 | 0.0 | 0.000000 | 1.000000 | 4.000000 | 27.000000 |
| Administrative_Duration | 12330.0 | 80.818611 | 176.779107 | 0.0 | 0.000000 | 7.500000 | 93.256250 | 3398.750000 |
| Informational | 12330.0 | 0.503569 | 1.270156 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 24.000000 |
| Informational_Duration | 12330.0 | 34.472398 | 140.749294 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 2549.375000 |
| ProductRelated | 12330.0 | 31.731468 | 44.475503 | 0.0 | 7.000000 | 18.000000 | 38.000000 | 705.000000 |
| ProductRelated_Duration | 12330.0 | 1194.746220 | 1913.669288 | 0.0 | 184.137500 | 598.936905 | 1464.157213 | 63973.522230 |
| BounceRates | 12330.0 | 0.022191 | 0.048488 | 0.0 | 0.000000 | 0.003112 | 0.016813 | 0.200000 |
| ExitRates | 12330.0 | 0.043073 | 0.048597 | 0.0 | 0.014286 | 0.025156 | 0.050000 | 0.200000 |
| PageValues | 12330.0 | 5.889258 | 18.568437 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 361.763742 |
| SpecialDay | 12330.0 | 0.061427 | 0.198917 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| OperatingSystems | 12330.0 | 2.124006 | 0.911325 | 1.0 | 2.000000 | 2.000000 | 3.000000 | 8.000000 |
| Browser | 12330.0 | 2.357097 | 1.717277 | 1.0 | 2.000000 | 2.000000 | 2.000000 | 13.000000 |
| Region | 12330.0 | 3.147364 | 2.401591 | 1.0 | 1.000000 | 3.000000 | 4.000000 | 9.000000 |
| TrafficType | 12330.0 | 4.069586 | 4.025169 | 1.0 | 2.000000 | 2.000000 | 4.000000 | 20.000000 |
# duration features mean and max value difference shows data is skewed
#product related page is visited by the user more than anyother page
#user visited the website has less influence with respect to special day
df.isnull().sum()
Administrative 0 Administrative_Duration 0 Informational 0 Informational_Duration 0 ProductRelated 0 ProductRelated_Duration 0 BounceRates 0 ExitRates 0 PageValues 0 SpecialDay 0 Month 0 OperatingSystems 0 Browser 0 Region 0 TrafficType 0 VisitorType 0 Weekend 0 Revenue 0 dtype: int64
# There are no null values in the datset
df.skew()
Administrative 1.960357 Administrative_Duration 5.615719 Informational 4.036464 Informational_Duration 7.579185 ProductRelated 4.341516 ProductRelated_Duration 7.263228 BounceRates 2.947855 ExitRates 2.148789 PageValues 6.382964 SpecialDay 3.302667 OperatingSystems 2.066285 Browser 3.242350 Region 0.983549 TrafficType 1.962987 Weekend 1.265962 Revenue 1.909509 dtype: float64
# There are features with high skewness - durations, page_values
# There are features with moderate skewness - bounce_rates, exit_rates
plt.figure(figsize=(100,95))
with plt.style.context({'axes.labelsize':100,
'xtick.labelsize':90,
'ytick.labelsize':90}):
ax=sns.heatmap(df.corr()[df.corr()>0.4],annot=True,annot_kws={"size":80},fmt='.2f')
ax.set_xlabel('x label')
# THERE IS MULTICOLLINEARITY
# PageValue has moderate correlation with revenue column
# This tells the page which is visited higher by the user generating high revenue
sns.pairplot(df,diag_kind='kde',hue='Revenue')
<seaborn.axisgrid.PairGrid at 0x1f13f173520>
# There is overlapping in this data
#numerical features
num=['Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates','ExitRates','PageValues']
df.hist(column=num)
plt.tight_layout()
#we can see the right skewed distribution
#there are outliers in exit rates and bounce rates,shows there are some unusual or rare events happened
col=['Administrative','Informational','ProductRelated','SpecialDay','OperatingSystems','Browser','Region','TrafficType','Month','VisitorType','Weekend']
c=1
plt.figure(figsize=(15,10))
for i in col:
ax=plt.subplot(6,2,c)
sns.countplot(x=df[i],ax=ax)
c=c+1
plt.tight_layout()
#repeated visit is for product related page
#operating system 2 has high usage
#from region 1 high number of visitor
#march,may,nov,dec shows high visitor than others show some seasonility in visiting
#weekday has high number of visitors foronline shopping than weekend
#there are more number of returning visitor than new visitor. this implies we should do more marketing to attract new visitors
#how features are having relation with target i.e revenue
num=['Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates','ExitRates','PageValues']
c=0
plt.figure(figsize=(30,45))
with plt.style.context({'axes.labelsize':24,
'xtick.labelsize':24,
'ytick.labelsize':24}):
for i in num:
c=c+1
ax=plt.subplot(6,2,c)
sns.boxplot(y=df[i],x=df['Revenue'],ax=ax)
ax.set_title('Revenue generated based on '+i)
plt.tight_layout()
#revenue generated by administrative page is slightly high than revenue not genrated by that page,shows the page is somehow
#profitable,this is same with respect to information related page
#with respect to product related the revenue is neither profit nor loss
#revenue generated by high page value is slightly higher than revenue genrated by the low page value
col=['Administrative','Informational','ProductRelated','SpecialDay','OperatingSystems','Browser','Region','TrafficType','Month','VisitorType','Weekend']
c=0
plt.figure(figsize=(30,45))
with plt.style.context({'axes.labelsize':24,
'xtick.labelsize':24,
'ytick.labelsize':24}):
for i in col:
c=c+1
ax=plt.subplot(6,2,c)
sns.countplot(x=df[i],hue=df['Revenue'],ax=ax)
plt.tight_layout()
for i in num:
m=np.mean(df[i])
s=np.std(df[i])
thresh=3
z=(df[i]-m)/s
outlier=[x for x in np.abs(z) if x>thresh]
print(i)
print('outlier count',len(outlier))
print('percentage of outliers',len(outlier)/df[i].shape[0]*100)
print()
Administrative_Duration outlier count 232 percentage of outliers 1.8815896188158963 Informational_Duration outlier count 230 percentage of outliers 1.8653690186536902 ProductRelated_Duration outlier count 219 percentage of outliers 1.7761557177615572 BounceRates outlier count 708 percentage of outliers 5.742092457420925 ExitRates outlier count 713 percentage of outliers 5.78264395782644 PageValues outlier count 259 percentage of outliers 2.1005677210056772
df.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 1 | 1 | 1 | 1 | Returning_Visitor | False | False |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Feb | 2 | 2 | 1 | 2 | Returning_Visitor | False | False |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 4 | 1 | 9 | 3 | Returning_Visitor | False | False |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Feb | 3 | 2 | 2 | 4 | Returning_Visitor | False | False |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Feb | 3 | 3 | 1 | 4 | Returning_Visitor | True | False |
df1=df.copy()
df1.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 1 | 1 | 1 | 1 | Returning_Visitor | False | False |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Feb | 2 | 2 | 1 | 2 | Returning_Visitor | False | False |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 4 | 1 | 9 | 3 | Returning_Visitor | False | False |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Feb | 3 | 2 | 2 | 4 | Returning_Visitor | False | False |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Feb | 3 | 3 | 1 | 4 | Returning_Visitor | True | False |
#labelencoding
from sklearn.preprocessing import LabelEncoder
LE=LabelEncoder()
df1['Weekend']=LE.fit_transform(df1['Weekend'])
df1['Revenue']=LE.fit_transform(df1['Revenue'])
df1.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 1 | 1 | 1 | 1 | Returning_Visitor | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Feb | 2 | 2 | 1 | 2 | Returning_Visitor | 0 | 0 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 4 | 1 | 9 | 3 | Returning_Visitor | 0 | 0 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Feb | 3 | 2 | 2 | 4 | Returning_Visitor | 0 | 0 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Feb | 3 | 3 | 1 | 4 | Returning_Visitor | 1 | 0 |
df1.dtypes
Administrative int64 Administrative_Duration float64 Informational int64 Informational_Duration float64 ProductRelated int64 ProductRelated_Duration float64 BounceRates float64 ExitRates float64 PageValues float64 SpecialDay float64 Month object OperatingSystems int64 Browser int64 Region int64 TrafficType int64 VisitorType object Weekend int64 Revenue int64 dtype: object
for i in ['Month','OperatingSystems','Browser','Region','TrafficType','VisitorType']:
df1[i]=df1[i].astype('object')
cat=['Month','OperatingSystems','Browser','Region','TrafficType','VisitorType']
dummy=pd.get_dummies(df1[cat],drop_first=True)
df1=df1.drop(columns=['Month','OperatingSystems','Browser','Region','TrafficType','VisitorType'])
df1.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | 0 | 0 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | 0 | 0 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | 0 | 0 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | 1 | 0 |
df_Prepared=pd.concat([df1,dummy],axis=1)
df_Prepared.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | ... | TrafficType_13 | TrafficType_14 | TrafficType_15 | TrafficType_16 | TrafficType_17 | TrafficType_18 | TrafficType_19 | TrafficType_20 | VisitorType_Other | VisitorType_Returning_Visitor | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 69 columns
sns.heatmap(df_Prepared[num].corr(),annot=True)
<AxesSubplot:>
#There is no Multicollinearity
sns.countplot(df_Prepared['Revenue'])
<AxesSubplot:xlabel='Revenue', ylabel='count'>
# yes there is class imbalance in dataset we can do oversampling and undersampling techniques like smote,Nearmiss
# we can treat class imbalance using smote after doing train test split
# treating class imbalance in ytrain because while model building
#the model should not be biased to one particlur class because it has high records
# because it tends to affect accuracy of the prediction
shapiro=pd.DataFrame(index=df.select_dtypes(include='number').columns,columns=['Shapiro p value'])
for i in df.select_dtypes(include='number').columns:
Tstat,pvalue=stats.shapiro(df[i])
shapiro.loc[i,'Shapiro p value']=pvalue
shapiro
| Shapiro p value | |
|---|---|
| Administrative | 0 |
| Administrative_Duration | 0 |
| Informational | 0 |
| Informational_Duration | 0 |
| ProductRelated | 0 |
| ProductRelated_Duration | 0 |
| BounceRates | 0 |
| ExitRates | 0 |
| PageValues | 0 |
| SpecialDay | 0 |
| OperatingSystems | 0 |
| Browser | 0 |
| Region | 0 |
| TrafficType | 0 |
#Data is not normal
#so We go with non parametric test
from sklearn.preprocessing import LabelEncoder
LE=LabelEncoder()
df['Weekend']=LE.fit_transform(df['Weekend'])
df['Revenue']=LE.fit_transform(df['Revenue'])
Mwu=pd.DataFrame(index=df.select_dtypes(include='number').drop(df['Revenue']).columns,columns=['pval'])
for i in df.select_dtypes(include='number').drop(df['Revenue']).columns:
group_1=df[i][df.Revenue==0]
group_2=df[i][df.Revenue==1]
hstat,pval=stats.mannwhitneyu(group_1,group_2)
Mwu.loc[i,'pval']=pval
Mwu
| pval | |
|---|---|
| Administrative | 6.55966e-78 |
| Administrative_Duration | 2.33613e-74 |
| Informational | 6.01391e-37 |
| Informational_Duration | 7.04519e-36 |
| ProductRelated | 5.26219e-108 |
| ProductRelated_Duration | 2.66613e-128 |
| BounceRates | 9.16385e-62 |
| ExitRates | 5.78412e-176 |
| PageValues | 0 |
| SpecialDay | 2.53978e-22 |
| OperatingSystems | 0.00120423 |
| Browser | 0.035731 |
| Region | 0.050246 |
| TrafficType | 0.447477 |
| Weekend | 0.000571296 |
| Revenue | 0 |
#Region is statistically significant in predicting the target variable
chisquare=pd.DataFrame(index=df.select_dtypes(include='object').columns,columns=['pval'])
for i in df.select_dtypes(include='object').columns:
group_1=df[i][df.Revenue==0]
group_2=df[i][df.Revenue==1]
chi,pval,sdf,sf=stats.chi2_contingency(pd.crosstab(index=df[i],columns=df['Revenue']))
chisquare.loc[i,'pval']=pval
chisquare
| pval | |
|---|---|
| Month | 2.23879e-77 |
| VisitorType | 4.2699e-30 |
# if palue<0.05 those feature are statsically sigifcant for predicting target class
# tht features are associated with target colum.
#Though some of the columns are statistical insiginicant we will also use those columns for model prediction as their may some
#sensitive data's
x = df_Prepared.drop(columns='Revenue')
y = df_Prepared['Revenue']
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=10)
mu = x.mean()
tstat, pval = stats.ttest_1samp(xtrain.select_dtypes(exclude = 'object'),popmean = mu)
print(pval)
[0.93509308 0.72917488 0.75990768 0.75366495 0.36404401 0.47110559 0.90074881 0.88713106 0.64765601 0.41075795 0.79142293 0.43600027 0.9155511 0.61943922 0.32111138 0.45512202 0.90007559 0.37715726 0.32837386 0.972444 0.36098451 0.43058388 0.79626568 0.92033285 0.36909608 0.65329631 0.45226993 0.64732773 0.50314526 0.92022868 0.50259395 0.72468634 0.77648644 0.34870198 0.76418438 0.91704688 0.92033285 0.13358344 0.34773912 0.71682786 0.4260716 0.38621059 0.71704558 0.56074211 0.58624303 0.38587667 0.98708001 0.74821969 0.67434538 0.80886463 0.76686354 0.77996066 0.58938001 0.9014143 0.37821994 0.6500734 0.4829283 0.76418438 0.76321397 0.69723651 0.79102351 0.60330312 0.76418438 1. 0.97695624 0.97271677 0.84335225 0.75155491]
# All the pval > 0.05. Hence train set is a representative of Original Data
mu = x.mean()
tstat, pval = stats.ttest_1samp(xtest.select_dtypes(exclude = 'object'),popmean = mu)
print(pval)
[0.90013339 0.57080859 0.61983673 0.62822478 0.11722141 0.18569464 0.84890227 0.82791604 0.51799135 0.18685957 0.68446715 0.24747657 0.8693605 0.42676894 0.08668858 0.26591331 0.84756588 0.16906694 0.1654201 0.95805218 0.16388188 0.23802921 0.69961904 0.88752968 0.0089159 0.27140372 0.32325032 0.48688353 0.36360574 0.87923104 0.27651118 0.60895088 0.63676763 0.08675704 0. 0.87556511 0.88752968 0.22043719 0.24322794 0.57249629 0.23416428 0.20414865 0.5645247 0.35615514 0.38854935 0.15001619 0.98024141 0.62236978 0.51530229 0.71524926 0.63863336 0.6771496 0.31685198 0.84801412 0.28492644 0.47015752 0.32325822 0. 0.65189914 0.62260632 0.65760463 0. 0. 1. 0.96431253 0.95814204 0.77206557 0.6326051 ]
# All the pval > 0.05. Hence Test set is a representative of Original Data
print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)
(8631, 68) (8631,) (3699, 68) (3699,)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
x_scaled = pd.DataFrame(pt.fit_transform(xtrain),columns=xtrain.columns)
xtest_scaled=pd.DataFrame(pt.transform(xtest),columns=xtest.columns)
x_scaled.skew()
Administrative 0.247594
Administrative_Duration 0.150628
Informational 1.419061
Informational_Duration 1.559372
ProductRelated -0.001746
...
TrafficType_18 35.077464
TrafficType_19 26.767526
TrafficType_20 7.689632
VisitorType_Other 12.077575
VisitorType_Returning_Visitor -2.038895
Length: 68, dtype: float64
xtest_scaled.skew()
Administrative 0.223879
Administrative_Duration 0.123979
Informational 1.370144
Informational_Duration 1.518069
ProductRelated -0.066765
...
TrafficType_18 35.085597
TrafficType_19 27.155098
TrafficType_20 7.730432
VisitorType_Other 11.580851
VisitorType_Returning_Visitor -1.993811
Length: 68, dtype: float64
# The skewness is treated by Power Transformer
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
# Logistic Regression
x=sm.add_constant(xtrain)
y=ytrain
log1=sm.Logit(y,x).fit(method='cg')
print(log1.summary())
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.595673
Iterations: 35
Function evaluations: 79
Gradient evaluations: 71
Logit Regression Results
==============================================================================
Dep. Variable: Revenue No. Observations: 8631
Model: Logit Df Residuals: 8563
Method: MLE Df Model: 67
Date: Mon, 18 Oct 2021 Pseudo R-squ.: -0.3899
Time: 15:19:25 Log-Likelihood: -5141.3
converged: False LL-Null: -3698.9
Covariance Type: nonrobust LLR p-value: 1.000
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
const -0.0110 0.189 -0.058 0.953 -0.381 0.359
Administrative -0.0056 0.014 -0.406 0.685 -0.033 0.021
Administrative_Duration -0.0026 0.000 -8.019 0.000 -0.003 -0.002
Informational -0.0006 0.032 -0.018 0.986 -0.063 0.061
Informational_Duration 0.0001 0.000 0.503 0.615 -0.000 0.001
ProductRelated -0.0417 0.002 -18.914 0.000 -0.046 -0.037
ProductRelated_Duration 0.0004 3.7e-05 9.802 0.000 0.000 0.000
BounceRates -0.0004 1.189 -0.000 1.000 -2.332 2.331
ExitRates -0.0008 1.312 -0.001 1.000 -2.572 2.570
PageValues 0.0885 0.004 24.426 0.000 0.081 0.096
SpecialDay -0.0010 0.141 -0.007 0.994 -0.278 0.276
Weekend -0.0024 0.061 -0.038 0.969 -0.122 0.118
Month_Dec -0.0017 0.159 -0.011 0.991 -0.314 0.310
Month_Feb -0.0003 0.235 -0.001 0.999 -0.460 0.459
Month_Jul -0.0004 0.195 -0.002 0.998 -0.382 0.381
Month_June -0.0003 0.211 -0.002 0.999 -0.414 0.414
Month_Mar -0.0023 0.156 -0.015 0.988 -0.309 0.304
Month_May -0.0038 0.152 -0.025 0.980 -0.302 0.295
Month_Nov -0.0012 0.153 -0.008 0.994 -0.302 0.299
Month_Oct -0.0003 0.192 -0.002 0.999 -0.377 0.377
Month_Sep -0.0004 0.195 -0.002 0.999 -0.383 0.382
OperatingSystems_2 -0.0052 0.143 -0.037 0.971 -0.285 0.274
OperatingSystems_3 -0.0027 0.149 -0.018 0.986 -0.295 0.290
OperatingSystems_4 -0.0004 0.148 -0.002 0.998 -0.291 0.290
OperatingSystems_5 -3.642e-06 7.46e+06 -4.88e-13 1.000 -1.46e+07 1.46e+07
OperatingSystems_6 -2.614e-05 0.570 -4.59e-05 1.000 -1.117 1.117
OperatingSystems_7 -1.422e-05 0.853 -1.67e-05 1.000 -1.672 1.672
OperatingSystems_8 -6.723e-05 0.505 -0.000 1.000 -0.991 0.990
Browser_2 -0.0069 0.143 -0.048 0.962 -0.286 0.273
Browser_3 -0.0001 0.307 -0.000 1.000 -0.601 0.601
Browser_4 -0.0006 0.179 -0.003 0.997 -0.351 0.349
Browser_5 -0.0004 0.190 -0.002 0.998 -0.373 0.372
Browser_6 -0.0002 0.255 -0.001 0.999 -0.500 0.500
Browser_7 -4.713e-05 0.396 -0.000 1.000 -0.776 0.776
Browser_8 -0.0002 0.233 -0.001 0.999 -0.458 0.457
Browser_9 -1.221e-06 2.304 -5.3e-07 1.000 -4.517 4.517
Browser_10 -0.0001 0.261 -0.000 1.000 -0.513 0.512
Browser_11 -3.642e-06 7.46e+06 -4.88e-13 1.000 -1.46e+07 1.46e+07
Browser_12 1.684e-06 1.117 1.51e-06 1.000 -2.189 2.189
Browser_13 -5.13e-05 0.708 -7.25e-05 1.000 -1.388 1.387
Region_2 -0.0010 0.092 -0.010 0.992 -0.182 0.180
Region_3 -0.0022 0.070 -0.031 0.976 -0.140 0.136
Region_4 -0.0010 0.092 -0.011 0.991 -0.182 0.180
Region_5 -0.0003 0.161 -0.002 0.999 -0.315 0.315
Region_6 -0.0008 0.104 -0.008 0.994 -0.204 0.202
Region_7 -0.0007 0.108 -0.007 0.995 -0.213 0.212
Region_8 -0.0005 0.134 -0.003 0.997 -0.264 0.263
Region_9 -0.0004 0.136 -0.003 0.997 -0.266 0.265
TrafficType_2 -0.0023 0.078 -0.029 0.977 -0.156 0.151
TrafficType_3 -0.0026 0.083 -0.031 0.975 -0.166 0.161
TrafficType_4 -0.0010 0.110 -0.009 0.993 -0.217 0.215
TrafficType_5 -0.0002 0.188 -0.001 0.999 -0.368 0.368
TrafficType_6 -0.0004 0.145 -0.003 0.998 -0.284 0.283
TrafficType_7 -5.695e-06 0.461 -1.24e-05 1.000 -0.904 0.904
TrafficType_8 -0.0001 0.168 -0.001 0.999 -0.328 0.328
TrafficType_9 -4.551e-05 0.438 -0.000 1.000 -0.858 0.858
TrafficType_10 -0.0003 0.145 -0.002 0.998 -0.284 0.283
TrafficType_11 -0.0002 0.194 -0.001 0.999 -0.381 0.381
TrafficType_12 -2.72e-06 2.011 -1.35e-06 1.000 -3.941 3.941
TrafficType_13 -0.0009 0.120 -0.008 0.994 -0.236 0.234
TrafficType_14 -3.397e-06 0.814 -4.17e-06 1.000 -1.596 1.596
TrafficType_15 -6.628e-05 0.417 -0.000 1.000 -0.816 0.816
TrafficType_16 -1.449e-07 1.388 -1.04e-07 1.000 -2.721 2.721
TrafficType_17 -2.701e-06 2.014 -1.34e-06 1.000 -3.946 3.946
TrafficType_18 -1.531e-05 0.820 -1.87e-05 1.000 -1.607 1.607
TrafficType_19 -2.042e-05 0.678 -3.01e-05 1.000 -1.328 1.328
TrafficType_20 -0.0002 0.234 -0.001 0.999 -0.459 0.459
VisitorType_Other -0.0001 0.417 -0.000 1.000 -0.817 0.817
VisitorType_Returning_Visitor -0.0099 0.083 -0.119 0.905 -0.173 0.153
=================================================================================================
lr = LogisticRegression()
lr.fit(x_scaled, ytrain)
pred_lr = lr.predict(xtest_scaled)
print (classification_report(ytest, pred_lr))
precision recall f1-score support
0 0.93 0.95 0.94 3115
1 0.71 0.61 0.65 584
accuracy 0.90 3699
macro avg 0.82 0.78 0.80 3699
weighted avg 0.89 0.90 0.90 3699
kf = KFold (n_splits =5,shuffle = True,random_state = 10)
score_lr = cross_val_score (lr, x_scaled, ytrain,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_lr))
print ('Variance error:',np.std(score_lr,ddof=1))
Bias score: 0.9117088872368054 Variance error: 0.00263945818000346
# Naive_bayes
gnb = GaussianNB()
gnb.fit (x_scaled, ytrain)
pred_gnb = gnb.predict(xtest_scaled)
print (classification_report (ytest, pred_gnb))
precision recall f1-score support
0 0.96 0.04 0.08 3115
1 0.16 0.99 0.28 584
accuracy 0.19 3699
macro avg 0.56 0.52 0.18 3699
weighted avg 0.83 0.19 0.11 3699
kf = KFold (n_splits =5, shuffle = True, random_state = 10)
score_gnb = cross_val_score (gnb, x_scaled, ytrain,cv=kf,scoring = 'roc_auc')
print ('Bias score:',np.mean(score_gnb))
print('Variance error:',np.std(score_gnb,ddof=1))
Bias score: 0.5467743430553775 Variance error: 0.04075739621955529
# K Neighbors Classifier
knn = KNeighborsClassifier()
knn.fit (x_scaled, ytrain)
pred_knn = knn.predict(xtest_scaled)
print (classification_report (ytest, pred_knn))
precision recall f1-score support
0 0.88 0.97 0.92 3115
1 0.61 0.27 0.37 584
accuracy 0.86 3699
macro avg 0.74 0.62 0.65 3699
weighted avg 0.83 0.86 0.83 3699
kf = KFold (n_splits =5, shuffle = True, random_state = 10)
score_knn = cross_val_score (knn, x_scaled, ytrain,cv=kf,scoring = 'roc_auc')
print ('Bias score:',np.mean(score_knn))
print('Variance error:',np.std(score_knn,ddof=1))
Bias score: 0.7863411273856273 Variance error: 0.014689176509020463
# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=10)
dt.fit(x_scaled, ytrain)
pred_dt = dt.predict(xtest_scaled)
print (classification_report (ytest, pred_dt))
precision recall f1-score support
0 0.92 0.92 0.92 3115
1 0.56 0.57 0.57 584
accuracy 0.86 3699
macro avg 0.74 0.74 0.74 3699
weighted avg 0.86 0.86 0.86 3699
kf = KFold (n_splits =5, shuffle = True, random_state = 10)
score_dt = cross_val_score (dt, x_scaled, ytrain,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_dt))
print('Variance error:',np.std(score_dt,ddof=1))
Bias score: 0.7327562971929138 Variance error: 0.014080161633066409
# Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(x_scaled, ytrain)
pred_rf = rf.predict(xtest_scaled)
print (classification_report (ytest, pred_rf))
precision recall f1-score support
0 0.91 0.97 0.94 3115
1 0.79 0.51 0.62 584
accuracy 0.90 3699
macro avg 0.85 0.74 0.78 3699
weighted avg 0.89 0.90 0.89 3699
kf = KFold (n_splits =5, shuffle = True, random_state = 10)
score_rf = cross_val_score(rf, x_scaled, ytrain,cv=kf,scoring='roc_auc')
print ('Bias score:',np.mean(score_rf))
print('Variance error:',np.std(score_rf,ddof=1))
Bias score: 0.9245884870807393 Variance error: 0.0058387480743214894
models=[]
models.append(('Logistic',lr))
models.append(('NaiveBayes',gnb))
models.append(('KNN',knn))
models.append(('DecisionTree',dt))
models.append(('RandomForest',rf))
data=[]
data1=[]
results=[]
names=[]
for name , model in models:
kfold =KFold (n_splits =5, shuffle = True, random_state = 10)
cv_results=cross_val_score(model,x_scaled,ytrain,scoring='roc_auc')
results.append(cv_results)
names.append(name)
data.append(np.mean(cv_results))
data1.append(np.std(cv_results,ddof=1))
a = pd.DataFrame({'Bias score':data,'Variance error':data1},index=names)
a.head()
| Bias score | Variance error | |
|---|---|---|
| Logistic | 0.911256 | 0.013996 |
| NaiveBayes | 0.567484 | 0.076740 |
| KNN | 0.783732 | 0.018745 |
| DecisionTree | 0.735969 | 0.029659 |
| RandomForest | 0.925249 | 0.007630 |
# BY THE BASE MODEL EVALUATION RANDOM FOREST CLASSIFIER IS PERFORMING BETTER THAN OTHER MODELS
sms=SMOTE(random_state=10,sampling_strategy=0.7)
xtrain_sm,ytrain_sm=sms.fit_resample(x_scaled,ytrain)
print(xtrain_sm.shape)
print(ytrain_sm.shape)
(12421, 68) (12421,)
xtrain.shape
(8631, 68)
ytrain.shape
(8631,)
sns.countplot(ytrain_sm)
<AxesSubplot:xlabel='Revenue', ylabel='count'>
#logistic Regression
lrsm = LogisticRegression()
lrsm.fit(xtrain_sm, ytrain_sm)
pred_lrsm = lrsm.predict(xtest_scaled)
print (classification_report(ytest, pred_lrsm))
precision recall f1-score support
0 0.96 0.89 0.92 3115
1 0.58 0.79 0.67 584
accuracy 0.88 3699
macro avg 0.77 0.84 0.80 3699
weighted avg 0.90 0.88 0.88 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_lrsm = cross_val_score (lrsm, xtrain_sm, ytrain_sm,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_lrsm))
print ('Variance score:',np.std(score_lrsm,ddof=1))
Bias score: 0.9335101941587208 Variance score: 0.004279587267516743
#Naives Bayes
gnbsm = GaussianNB()
gnbsm.fit (xtrain_sm,ytrain_sm)
pred_gnbsm = gnbsm.predict(xtest_scaled)
print (classification_report (ytest, pred_gnbsm))
precision recall f1-score support
0 0.96 0.04 0.08 3115
1 0.16 0.99 0.28 584
accuracy 0.19 3699
macro avg 0.56 0.52 0.18 3699
weighted avg 0.84 0.19 0.11 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_gnbsm = cross_val_score (gnbsm, xtrain_sm, ytrain_sm,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_gnbsm))
print('Variance:',np.std(score_gnbsm,ddof=1))
Bias score: 0.5973383733592823 Variance: 0.05822306927651666
#KNN
knnsm = KNeighborsClassifier()
knnsm.fit (xtrain_sm, ytrain_sm)
pred_knnsm = knnsm.predict(xtest_scaled)
print (classification_report (ytest, pred_knnsm))
precision recall f1-score support
0 0.93 0.81 0.87 3115
1 0.41 0.68 0.51 584
accuracy 0.79 3699
macro avg 0.67 0.75 0.69 3699
weighted avg 0.85 0.79 0.81 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_knnsm = cross_val_score (knnsm, xtrain_sm, ytrain_sm,cv=kf,scoring = 'roc_auc')
print ('Bias score:',np.mean(score_knnsm))
print('Variance:',np.std(score_knnsm,ddof=1))
Bias score: 0.9429169928693 Variance: 0.005659173747841147
#DECISION TREE
dtsm = DecisionTreeClassifier(random_state=10)
dtsm.fit(xtrain_sm, ytrain_sm)
pred_dtsm = dtsm.predict(xtest_scaled)
print (classification_report (ytest, pred_dtsm))
precision recall f1-score support
0 0.92 0.90 0.91 3115
1 0.53 0.59 0.56 584
accuracy 0.85 3699
macro avg 0.72 0.74 0.73 3699
weighted avg 0.86 0.85 0.85 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_dtsm = cross_val_score (dtsm, xtrain_sm, ytrain_sm,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_dtsm))
print('Variance:',np.std(score_dtsm,ddof=1))
Bias score: 0.8836800429341178 Variance: 0.006156145681644643
#RandomForestClassifier
rfsm = RandomForestClassifier()
rfsm.fit(xtrain_sm, ytrain_sm)
pred_rfsm = rfsm.predict(xtest_scaled)
print (classification_report (ytest, pred_rfsm))
precision recall f1-score support
0 0.94 0.94 0.94 3115
1 0.67 0.66 0.67 584
accuracy 0.90 3699
macro avg 0.80 0.80 0.80 3699
weighted avg 0.90 0.90 0.90 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_rfsm = cross_val_score(rfsm, xtrain_sm, ytrain_sm,cv=kf, scoring='roc_auc')
print ('Bias score:',np.mean(score_rfsm))
print('Variance:',np.std(score_rfsm,ddof=1))
Bias score: 0.9823153505549641 Variance: 0.0015935888226316253
#AFTER CLASS IMBALANCE THE BASE MODEL IS PERFORMING WELL
modelssm=[]
modelssm.append(('Logistic1',lrsm))
modelssm.append(('NaiveBayes1',gnbsm))
modelssm.append(('KNN1',knnsm))
modelssm.append(('DecisionTree1',dtsm))
modelssm.append(('RandomForest1',rfsm))
datasm=[]
data1sm=[]
resultssm=[]
namessm=[]
for namesm , model in modelssm:
kf =KFold (n_splits =5, shuffle = True, random_state = 10)
cv_resultssm=cross_val_score(model,xtrain_sm,ytrain_sm,cv=kf,scoring='roc_auc')
resultssm.append(cv_resultssm)
namessm.append(namesm)
datasm.append(np.mean(cv_resultssm))
data1sm.append(np.std(cv_resultssm,ddof=1))
a1 = pd.DataFrame({'Bias score':datasm,'Variance error':data1sm},index=namessm)
a1.head()
| Bias score | Variance error | |
|---|---|---|
| Logistic1 | 0.933510 | 0.004280 |
| NaiveBayes1 | 0.597338 | 0.058223 |
| KNN1 | 0.942917 | 0.005659 |
| DecisionTree1 | 0.883680 | 0.006156 |
| RandomForest1 | 0.982146 | 0.001720 |
Decision Tree TUNING
#Decision Tree Tuning
params={'criterion':['entropy','gini'],'min_samples_split':range(2,10),'max_depth':range(2,10)}
kf=KFold(5,random_state=10,shuffle=True)
gs=GridSearchCV(dtsm,cv=kf,param_grid=params,scoring='roc_auc')
gs.fit(xtrain_sm,ytrain_sm)
gs.best_params_
{'criterion': 'entropy', 'max_depth': 7, 'min_samples_split': 8}
dtsm_Tuned = DecisionTreeClassifier(criterion='entropy', max_depth= 7, min_samples_split=8)
dtsm_Tuned.fit(xtrain_sm, ytrain_sm)
pred_dtsm_tuned = dtsm_Tuned.predict(xtest_scaled)
print (classification_report (ytest, pred_dtsm_tuned))
precision recall f1-score support
0 0.95 0.90 0.92 3115
1 0.59 0.75 0.66 584
accuracy 0.88 3699
macro avg 0.77 0.83 0.79 3699
weighted avg 0.89 0.88 0.88 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_dtsm_Tuned = cross_val_score (dtsm_Tuned, xtrain_sm, ytrain_sm,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_dtsm_Tuned))
print('Variance:',np.std(score_dtsm_Tuned,ddof=1))
Bias score: 0.9538959655410479 Variance: 0.0021851944518763576
In tuning the Decision Tree Bias score increases, the Variance error is decreased
KNN TUNING
params = {'n_neighbors':np.arange(1,25,2),'metric':['hamming','euclidean','manhattan','Chebyshev']}
kf = KFold(n_splits=5,shuffle=True,random_state=10)
GS = GridSearchCV(estimator=knn,param_grid=params,cv=kf,scoring='roc_auc')
GS.fit(xtrain_sm,ytrain_sm)
GS.best_params_
{'metric': 'manhattan', 'n_neighbors': 5}
#KNNTuned
knnsm_Tuned = KNeighborsClassifier(n_neighbors=5,metric='manhattan')
knnsm_Tuned.fit (xtrain_sm, ytrain_sm)
pred_knnsm_Tuned = knnsm.predict(xtest_scaled)
print (classification_report (ytest, pred_knnsm_Tuned))
precision recall f1-score support
0 0.93 0.81 0.87 3115
1 0.41 0.68 0.51 584
accuracy 0.79 3699
macro avg 0.67 0.75 0.69 3699
weighted avg 0.85 0.79 0.81 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_knnsm_Tuned = cross_val_score (knnsm_Tuned, xtrain_sm, ytrain_sm,cv=kf,scoring = 'roc_auc')
print ('Bias score:',np.mean(score_knnsm_Tuned))
print('Variance:',np.std(score_knnsm_Tuned,ddof=1))
Bias score: 0.9555412217657102 Variance: 0.0030833965844697586
In tuning the KNN, Bias score increases but the Variance error is decreased
params=[{'criterion' : ['entropy', 'gini'],
'n_estimators':[90,100,150,200],
'max_depth' :[10,15,20],
'min_samples_split' : [2,5,8]}]
kf = KFold(n_splits=5,shuffle=True,random_state=10)
GS = GridSearchCV(estimator=rfsm,param_grid=params,cv=kf,scoring='roc_auc')
GS.fit(xtrain_sm,ytrain_sm)
GS.best_params_
{'criterion': 'entropy',
'max_depth': 20,
'min_samples_split': 2,
'n_estimators': 200}
#RandomForestClassifier
rfsm_Tuned = RandomForestClassifier(criterion='entropy',max_depth=20,min_samples_split=2,n_estimators=200)
rfsm_Tuned.fit(xtrain_sm, ytrain_sm)
pred_rfsm_Tuned = rfsm_Tuned.predict(xtest_scaled)
print (classification_report (ytest, pred_rfsm_Tuned))
precision recall f1-score support
0 0.94 0.94 0.94 3115
1 0.67 0.69 0.68 584
accuracy 0.90 3699
macro avg 0.81 0.81 0.81 3699
weighted avg 0.90 0.90 0.90 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_rfsm_Tuned = cross_val_score(rfsm_Tuned, xtrain_sm, ytrain_sm,cv=kf, scoring='roc_auc')
print ('Bias score:',np.mean(score_rfsm_Tuned))
print('Variance:',np.std(score_rfsm_Tuned,ddof=1))
Bias score: 0.9821150660138269 Variance: 0.001518579355933902
In tuning the Random Forest, Bias score increases but the Variance error is decreased
#TO TUNE THE LOGISTIC REGRESSION WE HAVE TO DO FEATURE SELECTION.
rfe=RFE(estimator=lr,n_features_to_select=5)
rfe.fit(xtrain_sm,ytrain_sm)
RFE(estimator=LogisticRegression(), n_features_to_select=5)
feat= pd.Series(data= rfe.ranking_, index= xtrain_sm.columns)
sig= feat[feat==1].index
sig
Index(['BounceRates', 'PageValues', 'Month_Nov', 'TrafficType_15',
'VisitorType_Returning_Visitor'],
dtype='object')
## forward selection
kf=KFold(n_splits=5,shuffle=True,random_state=10)
fl=sfs(estimator=lr,k_features='best',scoring='roc_auc',cv=kf,forward=False)
fl.fit(xtrain_sm,ytrain_sm)
be=list(fl.k_feature_names_)
be
['Administrative', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'ExitRates', 'PageValues', 'Month_Dec', 'Month_Feb', 'Month_Jul', 'Month_Mar', 'Month_May', 'Month_Nov', 'OperatingSystems_2', 'OperatingSystems_6', 'OperatingSystems_7', 'OperatingSystems_8', 'Browser_3', 'Browser_6', 'Browser_10', 'Browser_12', 'Region_2', 'Region_3', 'Region_4', 'Region_5', 'Region_7', 'Region_9', 'TrafficType_2', 'TrafficType_3', 'TrafficType_4', 'TrafficType_5', 'TrafficType_8', 'TrafficType_10', 'TrafficType_11', 'TrafficType_13', 'TrafficType_15', 'TrafficType_16', 'TrafficType_20', 'VisitorType_Other', 'VisitorType_Returning_Visitor']
# backward selection
f_ele=sfs(estimator=lr,k_features='best',forward=True,cv=kf,scoring='roc_auc')
f_ele.fit(xtrain_sm,ytrain_sm)
fe1=list(f_ele.k_feature_names_)
fe1
['Administrative', 'ProductRelated', 'ProductRelated_Duration', 'ExitRates', 'PageValues', 'Month_Dec', 'Month_Feb', 'Month_Jul', 'Month_Mar', 'Month_May', 'Month_Nov', 'Month_Sep', 'OperatingSystems_2', 'OperatingSystems_4', 'OperatingSystems_6', 'OperatingSystems_7', 'OperatingSystems_8', 'Browser_2', 'Browser_3', 'Browser_4', 'Browser_6', 'Browser_9', 'Browser_10', 'Browser_12', 'Region_2', 'Region_3', 'Region_4', 'Region_5', 'Region_7', 'Region_9', 'TrafficType_2', 'TrafficType_3', 'TrafficType_4', 'TrafficType_5', 'TrafficType_6', 'TrafficType_7', 'TrafficType_8', 'TrafficType_10', 'TrafficType_11', 'TrafficType_12', 'TrafficType_13', 'TrafficType_15', 'TrafficType_16', 'TrafficType_17', 'TrafficType_18', 'TrafficType_20', 'VisitorType_Other', 'VisitorType_Returning_Visitor']
#Building Tuned Logistic Regression Model
def perf2(model,belist):
model.fit(xtrain_sm[belist],ytrain_sm)
ypred_test_Tuned=model.predict(xtest_scaled[belist])
kf=KFold(n_splits=5,shuffle=True,random_state=10)
scores=cross_val_score(model,xtrain_sm[belist],ytrain_sm,cv=kf,scoring='roc_auc')
print(scores,'\n')
print('average scores',np.mean(scores))
print('bias score',np.mean(scores))
print('variance score',np.std(scores,ddof=1))
print('\nclassification report for test data')
print(classification_report(ytest,ypred_test_Tuned))
#Recursive Feature Elimination
lr=LogisticRegression()
perf2(lr,sig)
[0.92585213 0.93262592 0.92490698 0.92134672 0.93277162]
average scores 0.9275006750632185
bias score 0.9275006750632185
variance score 0.005034063792660561
classification report for test data
precision recall f1-score support
0 0.96 0.89 0.92 3115
1 0.57 0.78 0.66 584
accuracy 0.87 3699
macro avg 0.76 0.83 0.79 3699
weighted avg 0.90 0.87 0.88 3699
#FORWARD SELECTION
perf2(lr,be)
[0.93223434 0.93766826 0.93355854 0.929515 0.93915462]
average scores 0.9344261517510496
bias score 0.9344261517510496
variance score 0.003954295365617345
classification report for test data
precision recall f1-score support
0 0.96 0.89 0.92 3115
1 0.57 0.78 0.66 584
accuracy 0.87 3699
macro avg 0.76 0.84 0.79 3699
weighted avg 0.90 0.87 0.88 3699
#BACKWARD SELECTION
perf2(lr,fe1)
[0.93175282 0.93749389 0.93353788 0.92930986 0.93942831]
average scores 0.9343045529486173
bias score 0.9343045529486173
variance score 0.004137355656329273
classification report for test data
precision recall f1-score support
0 0.96 0.89 0.92 3115
1 0.57 0.78 0.66 584
accuracy 0.87 3699
macro avg 0.76 0.84 0.79 3699
weighted avg 0.90 0.87 0.88 3699
Backward Selected Features gives the best result in building the Logistic regression model
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
AB=AdaBoostClassifier(random_state=10)
params={'n_estimators':np.arange(1,100)}
kf=KFold(n_splits=5,shuffle=True,random_state=10)
gs=GridSearchCV(AB,params,cv=kf,scoring='roc_auc')
gs.fit(xtrain_sm,ytrain_sm)
gs.best_params_
{'n_estimators': 99}
AB=AdaBoostClassifier(n_estimators=99,random_state=10)
AB.fit(xtrain_sm, ytrain_sm)
pred_AB = AB.predict(xtest_scaled)
print (classification_report (ytest, pred_AB))
precision recall f1-score support
0 0.94 0.92 0.93 3115
1 0.63 0.69 0.66 584
accuracy 0.89 3699
macro avg 0.79 0.81 0.80 3699
weighted avg 0.89 0.89 0.89 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_AB = cross_val_score(AB, xtrain_sm, ytrain_sm,cv=kf, scoring='roc_auc')
print ('Bias score:',np.mean(score_AB))
print('Variance error:',np.std(score_AB,ddof=1))
Bias score: 0.9662572960598312 Variance error: 0.002853757720574624
#more than our own tuned Decision tree model. ADA boosted model gives better performance
#boosting base Logistic regression model
AB=AdaBoostClassifier(base_estimator=lrsm,random_state=10)
params={'n_estimators':np.arange(1,100)}
kf=KFold(n_splits=5,shuffle=True,random_state=10)
gs=GridSearchCV(AB,params,cv=kf,scoring='roc_auc')
gs.fit(xtrain_sm,ytrain_sm)
gs.best_params_
{'n_estimators': 38}
#ada boosted logistic regression
ABLR=AdaBoostClassifier(base_estimator=lrsm,n_estimators=38,random_state=10)
ABLR.fit(xtrain_sm, ytrain_sm)
pred_ABLR = ABLR.predict(xtest_scaled)
print (classification_report (ytest, pred_ABLR))
precision recall f1-score support
0 0.95 0.90 0.92 3115
1 0.58 0.76 0.66 584
accuracy 0.88 3699
macro avg 0.77 0.83 0.79 3699
weighted avg 0.89 0.88 0.88 3699
kf=KFold (n_splits =5, shuffle = True, random_state = 10)
score_ABLR = cross_val_score (ABLR, xtrain_sm, ytrain_sm,cv=kf, scoring = 'roc_auc')
print ('Bias score:',np.mean(score_ABLR))
print ('Variance score:',np.std(score_ABLR,ddof=1))
Bias score: 0.9333297124266646 Variance score: 0.004002160496831838
#by boosting we have improved the bias score of LR and Decision Tree classifier. Though the Random forest performs well
#now we are going to stack the top two models
modelss=[]
modelss.append(('Logistic',lrsm))
modelss.append(('NaiveBayes',gnbsm))
modelss.append(('KNN',knnsm))
modelss.append(('DecisionTree',dtsm))
modelss.append(('RandomForest',rfsm))
modelss.append(('decision_tree_tuned',dtsm_Tuned))
modelss.append(('KNN_tuned',knnsm_Tuned))
modelss.append(('Randomforest_Tuned',rfsm_Tuned))
modelss.append(('AdaboostedDT',AB))
modelss.append(('AdaboostedlogisticRegression',ABLR))
dataa=[]
dataa1=[]
resultss=[]
namess=[]
for name , model in modelss:
kfold =KFold (n_splits =5, shuffle = True, random_state = 10)
cv_resultss=cross_val_score(model,xtrain_sm,ytrain_sm,scoring='roc_auc')
resultss.append(cv_resultss)
namess.append(name)
dataa.append(np.mean(cv_resultss))
dataa1.append(np.std(cv_resultss,ddof=1))
az = pd.DataFrame({'Bias score':dataa,'Variance error':dataa1},index=namess)
az.head(10)
| Bias score | Variance error | |
|---|---|---|
| Logistic | 0.934042 | 0.007397 |
| NaiveBayes | 0.588860 | 0.061882 |
| KNN | 0.942815 | 0.013436 |
| DecisionTree | 0.882065 | 0.047925 |
| RandomForest | 0.982866 | 0.017025 |
| decision_tree_tuned | 0.950372 | 0.023677 |
| KNN_tuned | 0.957399 | 0.010439 |
| Randomforest_Tuned | 0.982510 | 0.017283 |
| AdaboostedDT | 0.964561 | 0.031178 |
| AdaboostedlogisticRegression | 0.933734 | 0.007401 |
#among these Adaboosted DT and random forest are the best models..so we are going to stack together these 2 models
from sklearn.ensemble import VotingClassifier
Stacked = VotingClassifier(estimators = [('AdaboostedDT',AB),
('RandomForest',rfsm)],voting='soft')
modelss=[]
modelss.append(('Logistic',lrsm))
modelss.append(('NaiveBayes',gnbsm))
modelss.append(('KNN',knnsm))
modelss.append(('DecisionTree',dtsm))
modelss.append(('RandomForest',rfsm))
modelss.append(('decision_tree_tuned',dtsm_Tuned))
modelss.append(('KNN_tuned',knnsm_Tuned))
modelss.append(('Randomforest_Tuned',rfsm_Tuned))
modelss.append(('AdaboostedDT',AB))
modelss.append(('AdaboostedlogisticRegression',ABLR))
modelss.append(('Stackedmodels',Stacked))
dataa=[]
dataa1=[]
resultss=[]
namess=[]
for name , model in modelss:
kfold =KFold (n_splits =5, shuffle = True, random_state = 10)
cv_resultss=cross_val_score(model,xtrain_sm,ytrain_sm,scoring='roc_auc')
resultss.append(cv_resultss)
namess.append(name)
dataa.append(np.mean(cv_resultss))
dataa1.append(np.std(cv_resultss,ddof=1))
az = pd.DataFrame({'Bias score':dataa,'Variance error':dataa1},index=namess)
| Bias score | Variance error | |
|---|---|---|
| Logistic | 0.934042 | 0.007397 |
| NaiveBayes | 0.588860 | 0.061882 |
| KNN | 0.942815 | 0.013436 |
| DecisionTree | 0.882065 | 0.047925 |
| RandomForest | 0.982568 | 0.017991 |
| decision_tree_tuned | 0.950428 | 0.024004 |
| KNN_tuned | 0.957399 | 0.010439 |
| Randomforest_Tuned | 0.982446 | 0.017489 |
| AdaboostedDT | 0.964561 | 0.031178 |
| AdaboostedlogisticRegression | 0.933734 | 0.007401 |
az.head(11)
| Bias score | Variance error | |
|---|---|---|
| Logistic | 0.934042 | 0.007397 |
| NaiveBayes | 0.588860 | 0.061882 |
| KNN | 0.942815 | 0.013436 |
| DecisionTree | 0.882065 | 0.047925 |
| RandomForest | 0.982568 | 0.017991 |
| decision_tree_tuned | 0.950428 | 0.024004 |
| KNN_tuned | 0.957399 | 0.010439 |
| Randomforest_Tuned | 0.982446 | 0.017489 |
| AdaboostedDT | 0.964561 | 0.031178 |
| AdaboostedlogisticRegression | 0.933734 | 0.007401 |
| Stackedmodels | 0.982836 | 0.017431 |
#among these stacked model performs well and it can be further moved to deployment.